Data reading and writting using Pandas

We will focus on three formats to store our data in disk:

  • CSV: comma separated value. Two standard sparators:
    • comma: american
    • semicolon: eurpean
  • XLS: excel file (xls or xlsx)
  • Pickle: python serialized file format

Pandas provide functions to load csv, xls and pickle files:

Examples

We will see some examples of file loading:

CSV


In [3]:
import pandas as pd

iqsize = pd.read_csv("https://raw.githubusercontent.com/f-guitart/data_mining/master/data/iqsize.csv")
iqsize.head()


Out[3]:
id piq brain height weight sex
0 0 124.0 81.69 64.5 118 Female
1 1 150.0 103.84 73.3 143 Male
2 2 128.0 96.54 68.8 172 Female
3 3 134.0 95.15 65.0 147 Male
4 4 110.0 92.88 69.0 146 Male

In [4]:
type(iqsize)


Out[4]:
pandas.core.frame.DataFrame

In [6]:
iqsize["sex"][:10]


Out[6]:
0    Female
1      Male
2    Female
3      Male
4      Male
5      Male
6    Female
7      Male
8    Female
9      Male
Name: sex, dtype: object

In [9]:
iqsize["sex"].to_csv("myseries.csv")
%ls myseries.csv


 Volume in drive C has no label.
 Volume Serial Number is 2800-91C9

 Directory of C:\Users\Francesc\Documents

24/09/2017  22:26               369 myseries.csv
               1 File(s)            369 bytes
               0 Dir(s)  11.173.228.544 bytes free

Excel

With excel files we can start the other way round: writting first.


In [10]:
iqsize.to_excel("iqsize.xlsx")
%ls iqsize.xlsx


 Volume in drive C has no label.
 Volume Serial Number is 2800-91C9

 Directory of C:\Users\Francesc\Documents

24/09/2017  22:31             7.044 iqsize.xlsx
               1 File(s)          7.044 bytes
               0 Dir(s)  11.176.837.120 bytes free

In [11]:
xls_iqsize = pd.read_excel("iqsize.xlsx")
xls_iqsize.head()


Out[11]:
id piq brain height weight sex
0 0 124.0 81.69 64.5 118 Female
1 1 150.0 103.84 73.3 143 Male
2 2 128.0 96.54 68.8 172 Female
3 3 134.0 95.15 65.0 147 Male
4 4 110.0 92.88 69.0 146 Male

Pickle

Why do we need serialized wirtting features?

Becasue sometimes we do want to store wierd things:

  • DataFrames with dictionaries, lists or objects in columns
  • Dictionaries of dataframes

In [17]:
my_df = pd.DataFrame({"a" : [{"apples": [1,2,3,4,6], "pears":2}, None, None, {"bannanas":4}],
                     "b" : [0,1,2,3]})
my_df.to_csv("mydf.csv")

In [20]:
my_df2 = pd.read_csv("mydf.csv")
type(my_df2.iloc[0,1])


Out[20]:
str

In [29]:
my_df.to_pickle("mydf.pickle")
my_df3 = pd.read_pickle("mydf.pickle")
my_df3.head()


Out[29]:
a b
0 {'apples': [1, 2, 3, 4, 6], 'pears': 2} 0
1 None 1
2 None 2
3 {'bannanas': 4} 3

In [31]:
type(my_df3.iloc[0,0])


Out[31]:
dict

In [23]:
train = pd.Series([1,2,3,4,5,6,7,8])
test = pd.Series([9,10,11])

pd.to_pickle({"train": train,
             "test" : test},"my_pickle.pickle")
%ls my_pickle.pickle


 Volume in drive C has no label.
 Volume Serial Number is 2800-91C9

 Directory of C:\Users\Francesc\Documents

24/09/2017  22:39               818 my_pickle.pickle
               1 File(s)            818 bytes
               0 Dir(s)  11.161.104.384 bytes free

In [24]:
my_pickle = pd.read_pickle("my_pickle.pickle")
my_pickle.keys()


Out[24]:
dict_keys(['train', 'test'])

In [26]:
type(my_pickle['train'])


Out[26]:
pandas.core.series.Series